-
Notifications
You must be signed in to change notification settings - Fork 192
fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector #3630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector #3630
Conversation
💚 CLA has been signed |
buildkite test this |
Ok I signed the contributor agreement too, hopefully the cla-checker-service will do another pass. |
CLA checked is happy, but test coverage fell below 92% now - you can verify it by running I also see that you've added the check in a lot of places - is it needed? Theoretically, if you just ignore the site itself, you won't need to propagate the change down to all entities, or is it not the case? |
319658c
to
a433cff
Compare
Hi Artem - sorry for the long silence, between holidays and lots of other stuff I didn't have time to fix this. You're right, it should be enough to keep the check in the core places. I updated it. |
buildkite test this |
buildkite test this |
Just to check, what is |
Hello Artem, it's a special URL that Sharepoint uses to manage content for private channels in teams and other entities like loop. Even with sites.fullcontrol you won't be able to expand permissions or content for these URLs and they'll cause the connector to fail. They got mentioned a few times but there's very little available info overall. https://learn.microsoft.com/en-us/answers/questions/5219021/where-are-loop-workspace-pages-actually-stored https://learn.microsoft.com/en-us/sharepoint/dev/embedded/overview Because these containers are not “classic SharePoint sites/libraries” the way you might expect:
Thus, even if you have full control on a related “site” or “workspace,” those permissions don’t automatically bridge into the embedded / contentstorage container. |
Thanks for the writeout! For me it was important to check what API returns this entity - and it seems like the Then your code, that was also added only on the level of the sites, would exclude them and path I'm double checking because we've had similar experience with special Lists that Sharepoint has - they are available via |
I maybe would suggest that before we merge the updated change, i run a longer sync, say, for a day, to really be sure it doesn't happen anywhere. The previous version with the repeated statements worked 100% but now I am getting doubts that perhaps some paths would still return /contentstorage |
We'll appreciate if you try doing a long sync with this code, yes! |
Alternatively, if you can provide the details about how to do the same setup on our side, we'll be happy to reproduce and test ourselves |
I've asked ChatGPT how to trigger their creation, so you might be able to try it on your end: /contentstorage/... paths get created by apps that use SharePoint Embedded (Loop app, Copilot Pages/Notebooks, and any custom app built on Embedded). They’re not normal “sites,” so Teams private/shared channels won’t make them; those create regular channel sites. To force-create /contentstorage so your Elastic crawler can see them, the quickest repro is to create some Loop content (or a Copilot Page/Notebook). Details + sources below. How to deliberately create /contentstorage/... URLs Option A — Microsoft Loop (fastest, no code)
You won’t see it in the SPO UI, but if you open browser dev tools while the page loads you’ll see network calls to that /contentstorage/... path.  Even though the container is “headless,” its HTTP endpoints exist and are what Loop calls; that’s the same surface your site crawler stumbled onto in prod. There are public threads showing the access denied behavior on those URLs, which confirms they’re real but not user-surfaced.  Option B — Copilot Pages / Copilot Notebooks
These urls are app-owned, headless containers: permissions are governed by the owning app/Embedded, not standard site ACLs. You can hit the URLs, but expanding permissions via “Site permissions” won’t work, which matches what you’re seeing. Official docs and reputable write-ups confirm the storage model and the /contentstorage pattern for Loop.  |
That's a good writeup, thanks! I'll give it a test and get back to you |
I started a full sync now with the code from this PR, will let it run a few good hours, and see how it goes. So far so good 25k items in. |
In the meantime I was able to confirm it works on our tenant - I created a Copilot Page and it introduced a I've checked out to the changes from this PR and the sync ran successfully. I think it's safe to merge this PR as is and your sync should finish well too. Since you've started one, we can delay the merge until you confirm that all works. |
That's awesome Artem - the sync was still going, 150k items, by now it'd have failed already for sure, as i remember it was always failing at around 40k items after hitting some site starting with A... that had loop components etc. in it. I think we're good to go. Also nice because it means i can stop maintaining my own fork of the connector! |
…3630) ## Closes #3603 The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing. ## Checklists #### Pre-Review Checklist - [x] this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `config.yml.example`) - [x] this PR has a meaningful title - [x] this PR links to all relevant github issues that it fixes or partially addresses - [x] this PR has a thorough description - [x] Tested the changes locally - [x] For bugfixes: backport safely to all minor branches still receiving patch releases ## Release Note Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/. --------- Co-authored-by: Artem Shelkovnikov <[email protected]>
…3630) ## Closes #3603 The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing. ## Checklists #### Pre-Review Checklist - [x] this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `config.yml.example`) - [x] this PR has a meaningful title - [x] this PR links to all relevant github issues that it fixes or partially addresses - [x] this PR has a thorough description - [x] Tested the changes locally - [x] For bugfixes: backport safely to all minor branches still receiving patch releases ## Release Note Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/. --------- Co-authored-by: Artem Shelkovnikov <[email protected]>
…3630) ## Closes #3603 The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing. ## Checklists #### Pre-Review Checklist - [x] this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `config.yml.example`) - [x] this PR has a meaningful title - [x] this PR links to all relevant github issues that it fixes or partially addresses - [x] this PR has a thorough description - [x] Tested the changes locally - [x] For bugfixes: backport safely to all minor branches still receiving patch releases ## Release Note Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/. --------- Co-authored-by: Artem Shelkovnikov <[email protected]>
…3630) ## Closes #3603 The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing. ## Checklists #### Pre-Review Checklist - [x] this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `config.yml.example`) - [x] this PR has a meaningful title - [x] this PR links to all relevant github issues that it fixes or partially addresses - [x] this PR has a thorough description - [x] Tested the changes locally - [x] For bugfixes: backport safely to all minor branches still receiving patch releases ## Release Note Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/. --------- Co-authored-by: Artem Shelkovnikov <[email protected]>
…3630) ## Closes #3603 The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing. ## Checklists #### Pre-Review Checklist - [x] this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check `config.yml.example`) - [x] this PR has a meaningful title - [x] this PR links to all relevant github issues that it fixes or partially addresses - [x] this PR has a thorough description - [x] Tested the changes locally - [x] For bugfixes: backport safely to all minor branches still receiving patch releases ## Release Note Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/. --------- Co-authored-by: Artem Shelkovnikov <[email protected]>
Thank you for your contribution, @maxesse! I've set everything to backport the changes you've done to the branches that we'll be releasing - it'll be available in Docker as soon as release happens, or from source the moment we merge all the backports. |
That's awesome Artem, happy to help! |
…ector (#3630) (#3788) Backports the following commits to 8.18: - fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector (#3630) Co-authored-by: Max Sanna <[email protected]> Co-authored-by: Artem Shelkovnikov <[email protected]>
…ector (#3630) (#3790) Backports the following commits to 8.19: - fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector (#3630) Co-authored-by: Max Sanna <[email protected]> Co-authored-by: Artem Shelkovnikov <[email protected]>
…ctor (#3630) (#3791) Backports the following commits to 9.2: - fix: Exclude /contentstorage/ URLs from Sharepoint Online Connector (#3630) --------- Co-authored-by: Max Sanna <[email protected]> Co-authored-by: Artem Shelkovnikov <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
Closes #3603
The change excludes specific /contentstorage/ urls from the sync in all API calls to Sharepoint. These URLs should not be attempted to be accessed as they're created internally by Sharepoint for Teams private channels, loop components, etc. (it's fairly undocumented what they're used for to be honest), and have a different permission model that will cause 401 errors and the connector to stop syncing.
Checklists
Pre-Review Checklist
config.yml.example
)Release Note
Fixes an issue where a Sharepoint Online sync configured to crawl the entire tenant by selecting * in the site list, might stop with 401 errors when trying to access URLs containing /contentstorage/.